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Abstract 

Background: Representing species-specific proteins and protein complexes in ontologies that are both human- 
and machine-readable facilitates the retrieval, analysis, and interpretation of genome-scale data sets. Although 
existing protin-centric informatics resources provide the biomedical research community with well-curated 
compendia of protein sequence and structure, these resources lack formal ontological representations of the 
relationships among the proteins themselves. The Protein Ontology (PRO) Consortium is filling this informatics 
resource gap by developing ontological representations and relationships among proteins and their variants and 
modified forms. Because proteins are often functional only as members of stable protein complexes, the PRO 
Consortium, in collaboration with existing protein and pathway databases, has launched a new initiative to 
implement logical and consistent representation of protein complexes. 

Description: We describe here how the PRO Consortium is meeting the challenge of representing species-specific 
protein complexes, how protein complex representation in PRO supports annotation of protein complexes and 
comparative biology, and how PRO is being integrated into existing community bioinformatics resources. The PRO 
resource is accessible at http://pir.georgetown.edu/pro/. 

Conclusion: PRO is a unique database resource for species-specific protein complexes. PRO facilitates robust 
annotation of variations in composition and function contexts for protein complexes within and between species. 



Background 

Logical and semantic access to related protein forms is 
critical for advancing bioinformatics approaches to 
representing, modeling, and reasoning about complex 
biological systems at the genomic and cellular level [1]. 
The Protein Ontology (PRO) Consortium develops and 
maintains ontology resources for the representation of 
protein forms for all organisms [2], PRO is one of the 
six inaugural ontologies that form the Open Biological 
and Biomedical Ontologies (OBO) Foundry [3]. 

The PRO has three components informally referred to 
as ProForm, ProEvo, and ProComp [4]. ProForm repre- 
sents species-specific and species-independent classes of 
protein isoforms, co- and post-translationally modified 
forms, and variant forms. ProEvo represents evolution- 
ary relatedness of proteins. ProComp, the focus of this 
manuscript, represents multi-protein complexes, with an 
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initial (but not exclusive) emphasis on protein compo- 
nents of complexes in mouse and human. ProComp, 
uses the Gene Ontology [5] (GO) definition of protein 
complex: "Any macromolecular complex composed of 
two or more polypeptide subunits, which may or may 
not be identical. Protein complexes may have other 
associated non-protein prosthetic groups, such as 
nucleotides, metal ions or other small molecules." 
(GO:0043234). Therefore protein complexes may, in 
addition, have components that are nucleic acids, carbo- 
hydrates, or lipids. Protein complexes are distinguished 
from protein-protein interactions in that they are conti- 
nuant entities, i.e. they endure or continue to exist 
through time. Interactions, in contrast, are occurrent 
entities, i.e. they occur in time through successive tem- 
poral phases. The explicit representation of protein 
complexes in PRO-defining each member of the com- 
plex at the level of its isoform, variant, or modified 
form-provides the ability to represent complex biologi- 
cal knowledge as it is emerging in the experimental 
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research community in structures that are both human 
readable and accessible to algorithmic approaches. 

ProComp leverages, and cross references, entries in 
existing protein-centric informatics resources, including 
the protein complexes that are represented in the Cellu- 
lar Component branch of the Gene Ontology. In the 
GO, types of protein complexes are defined in terms of 
constituent macromolecule classes and the function(s) 
that the complexes carry out. By agreement within the 
protein informatics community, PRO represents the spe- 
cies- specific classes of protein complexes, while GO, in 
most instances, represents the species-independent 
classes of protein complexes; within PRO, the latter are 
referred to by using GO identifiers. The UniProt Knowl- 
edgebase (UniProtKB) [6,7] describes protein complexes, 
though not as separate accessioned entities. References 
to protein records in UniProtKB are made through 
entries in the ProForm sub-ontology within PRO (Figure 
1). A major contribution of PRO as a protein biology 
community informatics resource is that it provides a 
formal ontological structure with foundation in Basic 



Formal Ontology http://www.ifomis.org/bfo/to describe 
types of protein complexes and gives these types unique, 
permanent identifiers http://www.obofoundry.org/id-pol- 
icy.shtml. PRO facilitates functional annotation of pro- 
teins and protein complexes with specific cellular 
contexts; it promotes compatibility with other ontology 
resources; and it promotes use of software tools for rea- 
soning and analysis. 

Uniquely among protein biology community infor- 
matics resources, PRO follows a principles-based strat- 
egy for ensuring ontological adequacy (i.e. biologically 
accurate, logically coherent and computationally useful). 
Some of these principles flow from the PRO's member- 
ship of the OBO (Open Biological and Biomedical 
Ontologies) Foundry. The OBO Foundry requires the 
use of Basic Formal Ontology http://www.ifomis.org/ 
bfo/ as the top-level formal ontological structure; this 
ensures compliance with standard methodologies for 
ensuring formal adequacy such as OntoClean [8]. The 
OBO Foundry also requires the use of a standard formal 
syntax (we maintain versions in both OWL and OBO, 



ProForm 



UniProtKB 



UniProt: 0.9 D 6 R2 

I so c itr ate cl e h y cl r o ge n ase 

[NAD] subunit alpha, 

mitochondrial 

(mouse) 



UniProt:Q91VA7 

I so ■: itr ate cl e h y cl r o ge n ase 

3(NAD+) beta 

(mouse) 



UniProtP70404 

I so ■: itr ate cl e hy cl r o ge n ase 

[NAD] subunit gamma 1. 

mitochondrial 

(mouse) 



PR:000025353 
i so ■: itr ate cl e hy cl r o ge n ase 
[NAD] subunit alpha, 
mitochondrial (mouse) 



PR:000025J55 
i so ■: itr ate cl e hy cl r o ge n ase 
[NAD] subunit alpha, 
mitochondrial isoform 1 

(mouse) 



PR:000025J56 

i so c itr ate cl e hy cl r o ge n ase 

[NAD] subunit alpha. 

mitochondrial isoform 2 

(mouse) 



PR:000025J59 
i so ■: itr ate cl e h y cl r o ge n ase 
[NAD] subunit beta, 
m ito ■: h o n cl r i al (mouse) 



PR:00002SJfiD 
i so citrate dehydrogenase 
[NAD] subunit gamma, 
mitochondrial (mouse) 




Pro Com p 



PR:000026071 
fvl ito ■: h o n cl r i al i so «: itr ate 
dehydrogenase complex 
(NAD+) (mouse) 



PR:00002S072 
mitochondrial isocitrate 
dehydrogenase complex 
(NAD+) A (mouse) 



Gene Ontology 



60:0005962 
mitochondrial isocitrate 
dehydrogenase complex 
(HAD+) 



PR:00002G073 
mitochondrial isocitrate 
dehydrogenase complex 
(NAD+) B (mouse) 



Figure 1 Schematic showing the relationships between UniProtKB, ProForm, GO, and ProComp. Arrows between ProForm and UniProtKB 
are xref, those between ProComp and ProForm are has_port, and between ProComp and Gene Ontology the arrows depict is_a relationships. In 
ProForm each protein form is assigned a unique identifier and is cross-referenced to protein entries in UniProtKB. Protein isoforms and modified 
forms are described in UniProtKB records, but in contrast to PRO, each protein form is not represented as a separate, uniquely accessioned entity 
in UniProtKB. For example, for the alpha subunit of IDH in mouse there is a PRO entry for the protein (PR:000025358) and for each of the alpha 
protein isoforms (PR:000025355 and PR:000025356). In UniProt the IDH alpha subunit and its isoforms are all represented in the same record 
(UniProt: Q9D6R2). In ProComp, accessioned, species-specific protein complex entities are described using protein entries from ProForm. The 
protein complexes in ProComp are cross-referenced to species-independent complex representations in the Gene Ontology (GO). 
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thus allowing not only use of logical reasoners for con- 
sistency checking, but also providing an added layer of 
quality assurance by requiring the two versions to satisfy 
the formal principles needed for the two-way OBO-to- 
OWL mapping tools to run effectively). In addition, the 
OBO Foundry requires that an ontology is maintained 
and regularly updated to keep pace with scientific 
advances, rectify errors, and fill gaps identified by its 
users. Other principles to ensure ontological adequacy 
apply specifically to the PRO and to its constituent 
branches, including ProComp. 

The development of ProComp, in particular, relies on 
high-level scientific content deriving from research on 
cellular contexts, diseases and, model organisms in addi- 
tion to protein biology and chemistry. This requirement 
is addressed by a final OBO Foundry principle, which 
requires that each ontology is maintained in such a way 
as to involve the developers and users of other neigh- 
boring ontologies (e.g., GO, ProForm, ProEvo, etc.) in 
order to ensure consistency of scientific content. The 
ongoing process of critical review by representatives of 
multiple complementary disciplines is designed to guar- 
antee that the ontology is developed on the basis of the 
best current scientific understanding of relevant subject- 
matters in each of these disciplines. 

In this communication we present elements of PRO's 
current approach for representing species-specific types 
of protein complex, illustrate how they are being inte- 
grated with existing pathway database resources, and 
describe how the ProComp effort facilitates comparative 
protein biology and functional genomics. 

Construction and Content 
Representation of protein complexes 

The Protein Ontology is authored using the OBO 1.2 
format [9]. The OBO format supports algorithmic rea- 
soning and is also human readable. Terms in an OBO 
ontology are described by "stanzas." Each OBO stanza 
has an identifier, term name, and a textual term defini- 
tion; the stanzas may also contain comments and syno- 
nyms. A series of relationship declarations are used to 
construct a logical definition. The declarations relate the 
term to other terms within the PRO ontology as well as 
to other ontologies, such as the GO (Figure 1). 

The protein complex types in PRO are defined by 
means of multiple part relationships to PRO-defined spe- 
cies-specific protein types. The species-specific protein 
types can be of isoforms or modified isoforms. Each PRO 
complex stanza includes an is_a (i.e. subtype) relation- 
ship to the corresponding Gene Ontology (GO) protein 
complex term (Figure 1). Cross references to protein 
entries in UniProtKB are not included in protein complex 
stanzas. Instead they are part of the stanzas that repre- 
sent the component proteins in ProForm (Figure 1). 



The initial development of ProComp makes use of 
examples of known protein complexes drawn from well- 
established, manually-curated pathway databases: Eco- 
Cyc [10], MouseCyc [11] and Reactome [12,13]. The 
close collaboration of PRO with these databases ensures 
that PRO allocates effort where there is demonstrated 
need, and that PRO's definitions are accessible to the 
biomedical research community in the functional con- 
text of cellular pathways. EcoCyc http://ecocyc.org 
represents transcriptional regulation, transporters, and 
biochemical pathways for Escherichia coli. MouseCyc 
http://mousecyc.jax.org focuses on pathways involved in 
biosynthesis, degradation, energy production, and detox- 
ification. MouseCyc is unique among mammalian path- 
way resources in that the database is connected with the 
rich biological knowledge about mouse gene function 
and phenotypes contained in the Mouse Genome Infor- 
matics database (MGI; http://www.informatics.jax.org) 
[14]. Reactome is a manually curated knowledgebase of 
human biological pathways http://www.reactome.org. In 
Reactome, pathways are represented as series of molecu- 
lar events that transform one or more input physical 
entities into one or more output entities, catalyzed or 
regulated by other entities. Entities include small mole- 
cules, proteins, post-translationally modified proteins, 
and complexes. Current priorities for ProComp are to 
create entries for all protein complexes in the contribut- 
ing databases with an emphasis first on those with pub- 
lished species-specific experimental evidence and those 
with human disease relevance. Although ProComp is 
being populated initially from a few targeted database 
resources, information about protein complexes will also 
be obtained from other database resources such as 
IntAct [15] and the primary published literature. 

Utility and Discussion 

Below we describe four use cases to highlight key ele- 
ments in the representation of protein complexes in 
PRO and to illustrate how aspects of biological knowl- 
edge and complexity are being captured in ProComp. 
Use Case 7. The 3-methylcrotonyl carboxylase (MCC) 
Complex: A heterodimeric complex 

The MCC protein complex (E.C. 6.4.1.4) is a mitochon- 
drial, biotin-dependent heterodimeric enzyme consisting 
of alpha and beta subunits. The PRO stanza for the 
murine MCC complex is illustrated below with key 
points illustrated by bold text. The ID for this protein 
complex type (PR:000025760 = http://purl.obolibrary. 
org/obo/PR_000025760) is unique, permanent, and is 
resolvable on the web to useful information. This class 
of complex is a subclass of the class of species-nonspeci- 
fic MCC protein complexes defined in the Gene Ontol- 
ogy (GO:0002169) and this relationship is explicitly 
represented using an is_a statement in the stanza. The 
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species specificity of the protein complex is indicated by 
the value of the only_in_taxon tag. The MCC complex 
has two components, an alpha and beta subunit repre- 
sented by the has_part relationships that assert that 
each instance of the complex contains an instance of 
indicated peptide. The embedded cardinality declaration 
indicates the number of instances of that type of subunit 
in the complex. In the case of murine MCC, the cardin- 
ality declarations indicate that each complex has exactly 
one of each subunit (alpha-beta). The absence of a car- 
dinality declaration indicates there is at least one copy 
of each subunit. 

We deliberately use has_part to relate the complex to 
the subunit rather than part_of to relate the subunit to 
the complex. The latter protein-centric statement would 
indicate that each instance of the peptide is part of 
some instances of the indicated complex, which is not 
necessarily the case, as the peptide could exist in a free 
form or as part of other complexes. Finally, that this is a 
protein complex only found in mouse is represented by 
tag only _in_t axon paired with the official taxon identi- 
fier. The totality of these relationship statements makes 
up the logical definition for this complex type. 

[Term] 

id: PR:000025760 

name: methylcrotonoyl-CoA carboxylase complex, 
mitochondrial (mouse) 

def: "A methylcrotonyl-CoA carboxylase complex, 
mitochondrial, whose components are encoded in the 
genome of mouse." [PRO:CJB] 

comment: Category = organism-complex. Entities of 
this type are disposed to have the enzymatic activity 
described by EG6.4.1.4. 

synonym: "beta-methylcrotonyl-CoA carboxylase 
(mouse)" EXACT [] 

synonym: "MCC (mouse)" EXACT [] 

is_a: GO:0002169! 3-methylcrotonyl-CoA carboxy- 
lase complex, mitochondrial 

relationship: has_part PR:000025354 {cardinality = 
"1"} ! methylcrotonoyl-CoA carboxylase subunit alpha, 
mitochondrial (mouse) 

relationship: has_part PR:000025357 {cardinality = 
"1"} ! methylcrotonoyl-CoA carboxylase beta chain, 
mitochondrial (mouse) 

relationship: only_in_taxon taxon: 10090 ! Mus 
musculus 

Use Case 2. The serine palmitoyltransferase (SPT) Complex: 
Differences in prokaryotic and eukaryotic protein complex 
organization 

Serine palmitoyltransferase (SPT; EC 2.3.1.50) catalyzes 
the key reaction in the biosynthesis of sphingolipids. In 
many eukaryotic species, this enzyme is a heterodimer 
consisting of two subunits, SPTLC1 and SPTLC2. 
Recently, it was shown that in human cells, SPTLC1 



and SPTLC2 form a complex with a third subunit, 
SPTLC3, with a resulting molecular mass of 480 kDa 
[16]. SPTLC1 is a common subunit to two core com- 
plexes SPTLC1-SPTLC2 and SPTLC1- SPTLC3, and it 
is likely that the ratio of SPTLC2 to SPTLC3 subunits 
in the SPT complexes, as well as binding of some smal- 
ler regulatory subunits, confers preferential activity of 
SPT complexes to specific acyl-CoA substrates [17,18]. 

A serine palmitoyltransferase complex is also found in 
gram-negative sphingolipid-containing bacteria. Unu- 
sually, the outer membranes of these bacteria contain 
glycosphingolipid (GSL) instead of lipopolysaccharide, 
and SPT catalyzes the first step of the GSL biosynthetic 
pathway in these organisms. But, as opposed to the 
human SPT complex, bacterial SPT complex is homodi- 
meric [19-21]. In some bacterial species the complex is 
water-soluble, whereas in eukaryotes the complex is 
membrane-bound. The component subunits of bacterial 
and eukaryotic SPT complexes are evolutionarily-related, 
and this relationship is evidenced in the ontology (Fig- 
ure 2; both the bacterial and eukaryotic serine palmi- 
toyltransferases trace back to the same progenitor class 
of pyridoxal 5'-phosphate (PLP)-dependent alpha-oxoa- 
mine synthase). 

The representation of SPT complexes in PRO shown 
below and illustrated in Figure 2 highlights the differences 
in composition among human complexes and between the 
human and bacterial complexes. Stanzas for two human 
SPT complexes are depicted below. The core complex is 
represented by PR:000026144 while PR:000026145 repre- 
sents the multimeric functional complex. The complex 
composition is formalized using the has_part relation to 
the corresponding classes of protein components. The 
functional human complex classes have an is_a relation- 
ship to the serine C-palmitoyltransferase complex class in 
GO (GO:0017059). The complex core has an is_a relation- 
ship to GO:0043234 (protein complex) as there is cur- 
rently not a more specific complex class in GO. 

[Term] 

id: PR:000026144 

name: serine palmitoyltransferase complex core 1 
(human) 

def: "A serine palmitoyltransferase complex that is 
heterodimeric consisting of one subunit of serine palmi- 
toyltransferase 1 and serine palmitoyltransferase 2. 
These components are encoded in the genome of 
human." [PRO:CNA] 

is_a: GO:0043234 ! protein complex 

relationship: has_part PR:000026141 {cardinality = 
"1"} ! serine palmitoyltransferase 1 (human) 

relationship: has_part PR:000026142 {cardinality = 
"1"} ! serine palmitoyltransferase 2 (human) 

relationship: only_in_taxon taxon:9606 ! Homo sapiens 

[Term] 
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bacterial serine 
palmitoyltransferase 

PR:000026127 




endoplasmic reticulum 
palmitoyltransferase complex 

GO .003 1211 



eukaryotic serine 
palmitoyltransferase 

PR.000026126 



homodimeric serine 
palmitoyltransferase complex 



bacterial serine 
palmitoyltransferase isoform 1 

PR .000026 128 



serine C-palmitoyltransferase 
complex 

GO.0017059 



serine palmitoyltransferase 
isoform 1 (Sphingobacterium 
multivorum) 

PR.000026168 




serine palmitoyltransferase 1 
(human) 

PR.000026141 



serine palmitoyltransferase 2 
(human) 

PR.000026142 



serine palmitoyltransferase 3 
(human) 

PR. 000026 143 



bacterial serine 
palmitoyltransferase complex 
(Sphingobacterium multivorum) 

PR. 000026 169 



serine palmitoyltransferase 
complex core 1 (human) 

PR 000026144 



serine palmitoyltransferase 
complex core 2 (human) 

PR. 000026 145 



Figure 2 PRO hierarchy depicting relationship for the human and bacterial SPT complex protein subunits, and for the SPT complexes. 

The blue arrows and "\" icons represent is_a relationships among the entities. The protein components have PR ids and the complex concepts 
have Gene Ontology (GO) ids. 



id: PR:000026155 

name: serine palmitoyltransferase complex A (human) 

def: "A serine palmitoyltransferase complex consisting 
of an unknown combination of the serine 

palmitoyltransferase subunits 1-3. The stoichiometry is 
a tetramer composed of the serine 

palmitoyltransferase core complexes 1 and/or 2." 
[PMID:17331073, PRO:CNA] 

comment: Category = organism-complex. Entities of 
this type are disposed to have the enzymatic activity 
described by EG2.3.1.50. 

is_a: GO:0017059 ! serine C-palmitoyltransferase 
complex 

relationship: has_part PR:000026141 {cardinality = 
"4"}! serine palmitoyltransferase 1 (human) 

relationship: has_part PR:000026142 ! serine palmi- 
toyltransferase 2 (human) 

relationship: has_part PR:000026143 ! serine palmi- 
toyltransferase 3 (human) 

relationship: only_in_taxon taxon:9606 ! Homo sapiens 



The protein complex stanza for the bacterial SPT 
complex (PR:000026169) is shown below. The bacterial 
and human complexes differ in composition (Figure 2) 
but perform closely similar biochemical functions. 

[Term] 

id: PR:000026169 

name: bacterial serine palmitoyltransferase complex 
(Sphingobacterium multivorum) 

def: "A homodimeric serine palmitoyltransferase complex 
that is composed of bacterial serine palmitoyltransferase 
encoded in the genome of Sphingobacterium multivorum." 
[PRO:CNA, PMID:17557831, PMID: 19564159] 

comment: Category = organism-complex. Entities of 
this type are disposed to have the enzymatic activity 
described by EG2.3.1.50. 

is_a: GO:0002179 ! homodimeric serine palmitoyl- 
transferase complex 

relationship: has_part PR:000026168 {cardinality = "2"} 
! bacterial serine palmitoyltransferase isoform 1 (Sphin- 
gobacterium multivorum) 
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relationship: only_in_taxon taxon:28454 ! Sphingobac- 
terium multivorum 

Use Case 3: Mitochondrial Isocitrate Dehydrogenase 
Complex; A heterotrimeric complex using different isoforms 
for one subunit 

There are three types of isocitrate dehydrogenases 
(IDHs) in mammals; two IDHs utilize NADP + as a 
cofactor in the oxidative decarboxylation of isocitrate 
and one utilizes NAD + . The NAD + -dependent IDH is a 
mitochondrial protein complex in the citric acid cycle; it 
consists of three subunits (alpha, beta, gamma), each of 
which is encoded by a single gene. The mouse alpha 
subunit gene encodes two different protein isoforms by 
alternative splicing. ProComp can distinguish between 
complexes that differ in the isoform of one of the con- 
stituents. As illustrated below (in bold), the protein 
complex stanzas from PRO describe two types of IDH 
complexes that differ only in the type of isoform of the 
alpha subunit. Both classes of isoform-specific protein 
complexes shown below (PR:000026072 and 
PR:000026073) are subclasses of the mouse-specific pro- 
tein complex class (PR:000026071) which, in turn, is a 
subclass of the species-independent complex class 
defined in GO (GO:0005962). 
[Term] 

id: PR:000026071 

name: mitochondrial isocitrate dehydrogenase com- 
plex (NAD+) (mouse) 

definition: A mitochondrial isocitrate dehydrogenase 
complex using NAD+ whose components are encoded 
in the genome of mouse [PRO:hjd] 

is_a:GO:0005962 ! mitochondrial isocitrate dehydro- 
genase complex (NAD+) 

relationship: has_part PR:000025358 ! isocitrate 
dehydrogenase [NAD] subunit alpha, mitochondrial 
(mouse) 

relationship: has_part PR:000025359 ! isocitrate dehy- 
drogenase [NAD] subunit beta, mitochondrial (mouse) 

relationship: has_part PR:000025360 ! isocitrate dehy- 
drogenase [NAD] subunit gamma, mitochondrial 
(mouse) 

relationship: only_in_taxon taxon:10090! Mus 
musculus 
[Term] 

id: PR:000026072 

name: mitochondrial isocitrate dehydrogenase com- 
plex (NAD+) A (mouse) 

def: A mitochondrial isocitrate dehydrogenase com- 
plex using NAD+ whose components are encoded in 
the genome of mouse containing isoform 1 of subunit 
alpha" [PRO:hjd] 

is_a:PR: 000026071! mitochondrial isocitrate dehydro- 
genase complex (NAD+) (mouse) 



relationship: has_part PR:000025355 ! isocitrate 
dehydrogenase [NAD] subunit alpha, mitochondrial 
isoform 1 (mouse) 

relationship: has_part PR:000025359 ! isocitrate dehy- 
drogenase [NAD] subunit beta, mitochondrial (mouse) 

relationship: has_part PR:000025360 ! isocitrate dehy- 
drogenase [NAD] subunit gamma, mitochondrial 
(mouse) 

relationship: only_in_taxon taxon:10090 ! Mus 
musculus 
[Term] 

id: PR:000026073 

name:mitochondrial isocitrate dehydrogenase complex 
(NAD+) B (mouse) 

def: A mitochondrial isocitrate dehydrogenase com- 
plex using NAD+ whose components are encoded in 
the genome of mouse containing isoform 2 of subunit 
alpha" [PRO:hjd] 

is_a:PR: 000026071! mitochondrial isocitrate dehydro- 
genase complex (NAD+) (mouse) 

relationship: has_part PR:000025356 ! isocitrate 
dehydrogenase [NAD] subunit alpha, mitochondrial 
isoform 2 (mouse) 

relationship: has_part PR:000025359 ! isocitrate dehy- 
drogenase [NAD] subunit beta, mitochondrial (mouse) 

relationship: has_part PR:000025360 ! isocitrate dehy- 
drogenase [NAD] subunit gamma, mitochondrial 
(mouse) 

relationship: only_in_taxon taxon:10090! Mus 
musculus 

Use Case 4. The Respiratory Chain IV Complex: 
Representing uncertainty in complex components 

Cytochrome c oxidase, also known as Complex IV (EC 
1.9.3.1), is a large transmembrane protein complex 
located in the inner mitochondrial membrane. Its func- 
tion is sequestration of free radicals, coupled to the 
transport of protons across the inner mitochondrial 
membrane, which creates the proton gradient used by 
mitochondrial ATP synthase during ATP production. 
Although the function of Complex IV is well under- 
stood, the exact composition of the complex is not cer- 
tain due to apparent genetic redundancy. Complex IV is 
composed of 13 polypeptides: three encoded in the 
mitochondrial genome, and ten encoded in the nuclear 
genome [22]. The uncertainty for this complex stems 
from the fact that some components of the complex are 
known (or predicted) to be encoded by multiple genes 
in the nuclear genome. The PRO stanza for Complex IV 
shown below illustrates how has_part statements that 
refer to "either-or" classes of proteins are used to repre- 
sent uncertainty in a compact form with no loss of 
information. The union_of statement (highlighted by 
bold text in the stanza below) indicates which genes 
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could code for a particular protein component. The 
union_oft&% appears in the appropriate protein compo- 
nent stanzas, not in the protein complex stanza. These 
tags are shown below (as comments) for illustration 
purposes only. 
[Term] 

id: PR:000026295 

name: respiratory chain complex IV (mouse) 
def: A mitochondrial respiratory chain complex IV 
whose components are encoded in the genome of 
mouse. [PRO:CJB, PMID:8638158, PMID:21211513] 

is_a: GO:0005751 ! mitochondrial respiratory chain 
complex IV 

relationship: has_part PR:000025364 {cardinality = "1"} 
! cytochrome oxidase subunit 1, mitochondrial (mouse) 

relationship: has_part PR:000025366 {cardinality = "1"} 
! cytochrome oxidase subunit 2, mitochondrial (mouse) 

relationship: has_part PR:000025367 {cardinality = "1"} 
! cytochrome oxidase subunit 3, mitochondrial (mouse) 

relationship: has_part PR:000026286 complex IV 
component 4 (mouse) 

! union_of: PR:000025368 ! Cox4il 
! union_of: PR:000025369 ! Cox4i2 

relationship: has_part PR:000026294 {cardinality = "1"} 
! cytochrome c oxidase subunit 5A, mitochondrial, tran- 
sit peptide removed form (mouse) 

relationship: has_part PR:000025371 {cardinality = "1"} 
! cytochrome c oxidase subunit 5B (mouse) 

relationship: has_part PR:000026287 {cardinality = "1"} 
! complex IV component 6a (mouse) 

! union_of: PR:000025372 ! Cox6al 
! union_of: PR:000025373 ! Cox6a2 

relationship: has_part PR:000026288 {cardinality = "1"} 
! complex IV component 6b (mouse) 

! union_of: PR:000025374 ! Cox6bl 
! union_of: PR:000025375 ! Cox6b2 

relationship: has_part PR:000025376 {cardinality = "1"} 
! cytochrome c oxidase subunit 6C (mouse) 

relationship: has_part PR:000026289 {cardinality = "1"} 
! complex IV component 7A (mouse) 

! union_of: PR:000025377 ! Cox7al 
! union_of: PR:000025378 ! Cox7a2 
! union_of: PR:000025379 ! Cox7a21 

relationship: has_part PR:000027496 {cardinality = "1"} 
! complex IV component 7B (mouse) 



! union of PR:000027491 ! Cox7b 
! union of PR:000027489 ! Cox7b2 

relationship: has_part PR:000025383 {cardinality = "1"} 
! cytochrome c oxidase subunit 7C, mitochondrial 
(mouse) 

relationship: has_part PR:000026290 {cardinality = "1"} 
! complex IV component 8 (mouse) 

! union_of: PR:000025380 ! Cox8a 
! union_of: PR:000025381 ! Cox8b 
! union_of: PR:000025382 ! Cox8c 

relationship: only_in_taxon taxon:10090 ! Mus 
musculus 

While not illustrated in the use cases above, many 
PRO complexes have species-specific protein types of 
modified isoforms. A few examples are: (i) 
PR:000025933 smad2-smad4 protein complex 1 
(human), which contains active phosphorylated form of 
smad2 (MAD homolog 2), (ii) PR:000026035 myc-max 
acetylated complex (human), where both myc (myelocy- 
tomatosis oncogene) and max (Max protein) are acety- 
lated, and (iii) PR:000027084 IRF3-P:IRF7-P complex 
(human), which contains the active phosphorylated 
forms of IRF3 (interferon regulatory factor 3) and IRF7 
(interferon regulatory factor 7). 
Viewing Protein Complex Data in PRO 
The web display of the ProComp stanza for the human 
SPT complex described above in Use Case 2 http://purl. 
obolibrary.org/obo/PR_000026144 is shown in Figure 3. 
The web page provides human-readable access to 
entries in ProComp as well as navigation to detailed 
records about the complex subunits and related nodes 
in PRO. 

Annotating Protein Complexes 

The ProComp ontology provides species-specific, acces- 
sioned protein complex entities that support robust, 
context-specific annotation of protein biology [23]. 
Comprehensive functional annotation of proteins and 
protein complexes is not a mission of the PRO consor- 
tium; however, by providing uniquely accessioned pro- 
tein complex entities PRO provides a critical and 
necessary resource to support unambiguous sharing of 
annotations by specialists across multiple biological dis- 
ciplines. To support data exchange and integration by 
model organism and pathway database groups that are 
annotating protein complexes, biological annotations of 
proteins and protein complexes can be represented in 
Protein Annotation Files (PAF). PAFs are modeled on 
the Gene Annotation Files (GAFs) generated for GO 
annotations http://www.geneontology.org/GO.format. 
gaf-l_0.shtml. The PAF format consists of a standard 
header and 20 tab-delimited columns (11 required; 9 
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Figure 3 Screenshot of the PRO entry for human serine palmitoyltransferase complex core 1 (SPT) as displayed on the PRO web site. 
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optional; see ftp://ftp.pir.georgetown.edu/databases/ 
ontology/pro_obo/ PAF_guidelines.pdf). 

Figure 4 illustrates how a PAF is used to provide bio- 
logical context for proteins and protein complexes for 
the MCC complex described in Use Case 1 above. 3- 
Methylcrotonylglycinuria is an inborn error of leucine 
catabolism caused by deficiency of 3-methylcrotonyl- 
Co A carboxylase activity (MCC; EC 6.4.1.4). Knowledge 
that mutations in human genes that encode the protein 
components of the MCC complex: MCCC1 (subunit 
alpha) and MCCC2 (subunit beta), cause methylcroto- 
nylglycinuria type I (OMIM #210200) and methylcroto- 
nylglycinuria type II (OMIM #210210), respectively, are 
represented in the PAF file (ftp://ftp.pir.georgetown.edu/ 
databases/ontology/pro_obo/, and Figure 4). As shown, 



each variant of MCCC1 and MCCC2 has its own PRO 
ID and annotation. Annotations of variants may include 
sequence-related attributes (annotated with Sequence 
Ontology [24] terms; SO), relation to disease (annotated 
with MIM [25]), and/or functional information (anno- 
tated with GO). In this particular case the SO annota- 
tion provides important information regarding the 
sequence nature of the variant. Two of the variants in 
Figure 4 are generated because of missense mutations 
that, while within exons, alter RNA splicing either by 
removing a splice site (MCCAD532H, PR:000026111), 
or by activating a cryptic splice donor site (MCCB 
I437V, PR:000026121) [26]. In addition, the PAF makes 
it possible to indicate sequence variants which alter a 
biological process or function, as in MCCA L437P. 
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Figure 4 Example of a Protein Annotation File (PAF) showing biological annotations for MCCC1 (MCCA) and MCCC2 (MCCB) human 
variants. Only columns with information are shown and short names of the variants are used. This PAF file is available via the PRO ftp site: ftp:// 
ftp.pir.georgetown.edu/databases/ontology/pro_obo/. 
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Conclusions 

The uniquely identified, species-specific protein complex 
classes in ProComp serve as a framework for formal 
representation of these biological entities and their bio- 
logical contexts. For example, Na+-K+-ATPases, or 
"sodium pumps" create both an electrical and chemical 
gradient across the plasma membrane. Sodium pumps 
are usually tetramers consisting of two catalytic alpha, 
and two membrane-bound beta subunits. However, in 
high osmolarity environments such as kidney epithe- 
lium, a third (gamma) subunit is required for their 
proper functioning [27,28]. In mammals, each of the 
subunits is encoded by members of a multi-gene family 
and different subunit combinations likely provide tissue 
and developmental specificities of the entities exercising 
sodium pump functions. As illustrated in this manu- 
script, the design of the ProComp ontology supports the 
representation of the compositional complexity of pro- 
tein complexes and, combined with information from 
manually-curated databases such as Reactome and Mou- 
seCyc, facilitates robust annotation of rich biological 
contexts as well. Future directions for data acquisition 
and curation within the ProComp project include work- 
ing with a wide range of community bio-curation 
resources and investigators to create protein complex 
records that support curation of experimental annota- 
tions. We are developing software to convert PRO-rele- 
vant information from database resources into an 
ontological format. We are also extending an existing 
web-based data entry tool http://pir.georgetown.edu/cgi- 
bin/pro/race_pro that will support direct submission of 
protein complexes to PRO. The representation of pro- 
tein complexes classes in PRO will be revisited fre- 
quently as additional information and evidence about 
their composition and stochiometry is identified in the 
scientific literature by PRO curators. 

Improvements in the PRO ontology and annotation 
have come to light by our work on the use cases 
described in this manuscript. We outline a number of 
known issues that serve as the basis for future work. 
First, while the assertions in stanzas within PRO are 
intended to be backed by evidence, the distinction 
between experimental and inferential evidence (both 
legitimate) is not yet captured clearly. This is especially 
true in cases where the evidence for composition and 
stochiometry for protein complexes is derived from 
orthologous proteins and complexes in other organisms. 
Usefully recording such information, as we recognize is 
important, was hampered by deficits in our curation 
tools that made it impossible to record evidence for 
individual relations or, more specifically, cardinality. We 
are working with the Ontology for Biomedical Investiga- 
tions (OBI; http://obi-ontology.org/) group on the devel- 
opment and application of an evidence code ontology to 



describe the evidence used to support protein complex 
assertions in ProComp, and with developers of the cura- 
tion tools to enable capturing such evidence with appro- 
priate granularity. Second, our current work records 
only protein monomer components of complexes facili- 
tated by an interface that makes selection of the mono- 
mers and their identifiers manageable. However, as we 
note, protein complexes (such as Complex IV in Use 
Case 4) are known to have lipid, heme, and other co- 
factors as components. Extensions to our curation tools 
will be developed in order to select and include such 
components, identifiers for which are provided by, e.g., 
the Chemical Entities of Biological Interest (CheBI) 
ontology [29]. Third, there is further work to be done 
on the relations that connect protein complex and gene 
product mutation variants to disease. We have used 
has_agent and is_agent_in relations in our annotations 
to connect, provisionally, proteins with representation of 
sequence variation from the Sequence Ontology. How- 
ever this does not agree with the formal definition in 
which these relations connect processes with their parti- 
cipants [30]. We plan to work with developers of the 
Relation Ontology to define appropriate relations. 
Fourth, we will work to improve the axioms we use to 
better enable use of reasoners to check consistency and 
facilitate advanced query. For example, ProComp cur- 
rently uses has_part to relate a protein complex to its 
components and uses cardinality restriction to represent 
stoichiometry. However, because has_part is a transitive 
relation, the ontology falls outside OWL2-DL making it 
not possible to use standard reasoners. Thus, we are 
evaluating the possibility of replacing has_part with a 
new relation that is a non-transitive sub-property of the 
has_part relation. Finally, we will implement closure 
axioms that would more clearly define the number and 
kind of complex components. For example, our repre- 
sentation of Complex IV allows that there might be 
additional protein components because of the open 
world assumption. A closure axiom would assert that 
exactly 13 of the components of this complex are pro- 
teins, while still allowing that there might be other 
kinds of components. 

Availability and requirements 

The Protein Ontology resource can be accessed on-line 
http://pir.georgetown.edu/pro/. Users can search PRO 
using accession IDs (e.g. PRO, UniProtKB, GO) or text 
(e.g., term name, definitions, synonyms). Searches can 
be restricted to specific modified forms (phosphorylated, 
etc.), database, or membership in a complex. PRO 
entries can also be accessed via hypertext links on gene 
detail pages in the Mouse Genome Informatics (MGI) 
database and from MouseCyc, Reactome, and EcoCyc. 
PRO (Release 20) contains 168 complexes drawn from 
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Reactome (human), MouseCyc (mouse), EcoCyc (E. coli) 
and direct submissions from collaborating research 
groups. OBO Edit, an open source ontology editing tool 
http://oboedit.org/[31], can be used to view all of the 
relationships represented in protein complex stanzas 
from ProComp. 

Existing conversion tools allow for OBO-formatted 
ontologies to be converted into Web ontology language 
(OWL) [32,33]. The PRO ontology is available in a 
number of formats, including OWL (latest version 
always at http://purl.obolibrary.org/obo/pr.owl). 
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